Create a data visualisation showing average rating and proportion of cocoa percent (% chocolate) greater than or equal to 70% by top 15 company location.
Create a data visualisation showing average rating and proportion of cocoa percent (% chocolate) greater than or equal to 70% by top 15 company location.
Methods to visualise statistical uncertainty were employed.
To address the requirements of the task, chocolate.csv data set was used. The DT package was installed to display an interactive datatable to augment the graph. The crosstalk package was installed to link multiple HTML widgets (e.g. a graph and a datatable) within RMarkdown.
The data preparation was done as follows:
Code chunk:
choc <- read_csv("data/chocolate.csv")
# Drop the % symbol in cocoa percent column and convert data type to numeric
choc$cocoa_percent<-gsub("%","",as.character(choc$cocoa_percent)) %>%
as.numeric(choc$cocoa_percent)
# convert cocoa_percent into decimal for easier manipulation
choc$cocoa_percent <- 0.01*choc$cocoa_percent
choc_loc <- choc %>%
select(`company_location`,`rating`,`cocoa_percent`)
Group data by company location, creating a new summary table of frequency count, average rating score and standard deviation
Create a new variable, standard error, calculated using the formula standard error = standard deviation/sqrt(sample size - 1).
Slice out the top 15 locations by frequency count.
Format the values by rounding off to 2 decimal places
Code chunk:
Code chunk:
avgPct <- choc_loc %>%
filter(choc_loc$cocoa_percent >= 0.7) %>%
group_by(company_location) %>%
summarise(nP=n(), meanP = mean(`cocoa_percent`)) %>%
mutate(seP = sqrt(((`meanP`)*(1-`meanP`))/nP))%>%
slice_max(`nP`, n=15) %>%
mutate(meanP100 = meanP*100)
avgPct <- avgPct[,c("company_location", "nP", "meanP100", "seP", "meanP")]
avgPct$meanP100 <- round(avgPct$meanP100, digits = 1)
avgPct$seP <- round(avgPct$seP, digits = 3)
To visualise the uncertainties, ggplotly was used with the following customisations:
In addition, a linked data table was created using the crosstalk method to allow users to: - View the full details of the frequency, mean, standard deviation and standard error - Sort any of these columns by clicking on the button at the top of each column
As the two components are linked, selecting any row in the table (e.g. the row with the highest rating or lowest cocoa percentage) will highlight the corresponding element on the confidence interval graph
The code chunk and visualisation for Average Rating are as follows:
#linked charts of Ratings
# Wrap data frame in SharedData
shared_rating = SharedData$new(avgR)
# Render graph
bscols( widths = c(12,12),
ggplotly((ggplot(shared_rating) +
geom_errorbar(
aes(x=reorder(company_location,-nR,),
ymin=meanR-1.98*se,
ymax=meanR+1.98*se),
width=0.2,
colour="black",
alpha=0.9,
size=0.5) +
geom_point(aes
(x=company_location,
y=meanR,
text = paste("Location:", `company_location`,"<br>Avg. Rating:",`meanR`,"<br>Max:", round((meanR+1.98*se), digits = 2), " Min:", round((meanR-1.98*se), digits = 2))),
stat="identity",
color="red",
size = 1.5,
alpha=1) +
xlab("Company Location") +
ylab("Average Ratings") +
theme_minimal()+
theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=1))+
ggtitle("95% Confidence Interval of Average Rating by Top 15 Locations")),
tooltip = "text"),
DT::datatable(shared_rating, rownames = FALSE, options = list(pageLength = 5, scrollX=T), colnames = c("No. of Locations", "Average Rating","Std Dev","Std Error"))
)
The code chunk and visualisation for Cocoa Percentage are as follows:
#linked charts of Cocoa Percent
# Wrap data frame in SharedData
shared_pct = SharedData$new(avgPct)
# Render graph
bscols( widths = c(12,12),
ggplotly((ggplot(shared_pct) +
geom_errorbar(
aes(x=reorder(company_location,-nP,),
ymin=meanP-1.98*seP,
ymax=meanP+1.98*seP),
width=0.2,
colour="black",
alpha=0.9,
size=0.5) +
geom_point(aes
(x=company_location,
y=meanP,
text = paste("Location:", `company_location`,"<br>Cocoa:",`meanP100`,"%<br>Max:", round((meanP+1.98*seP)*100, digits = 1), "% Min:", round((meanP-1.98*seP)*100, digits = 1),"%")),
stat="identity",
color="red",
size = 1.5,
alpha=1) +
xlab("Company Location") +
ylab("Cocoa Percentage (%)") +
theme_minimal()+
theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=1))+
ggtitle("95% Confidence Interval of Cocoa Percentage 70% and above
by Top 15 Locations")),
tooltip="text"),
DT::datatable(shared_pct, rownames = FALSE, options = list(pageLength = 5,scrollX=T, columnDefs = list(list(visible=FALSE, targets=c(4)))),colnames = c("No. of Locations", "Avg Percentage","Pct Std Error","Pct"))
)
U.S.A. was the top company location with 1136 locations for Average Rating and 974 for Cocoa Percentage, followed far behind by Canada (177, 163) and France (176, 130).
The highest chocolate rating was from Australia at 3.36. Australia also clocked the second lowest cocoa percentage of 71.6% in the list.
Incidentally, Ecuador scored the lowest average rating at 3.04, with the highest cocoa percentage at 76.7%.
The only Asian country in the top 15 list was Japan, with an average rating of 3.13 and cocoa percentage of 71.7%.
The lowest cocoa percentage in the top 15 came from Denmark, at 71.1%.
For both the Average Rating and Cocoa Percentage charts, the USA had the narrowest confidence intervals at 95%. This means that there is a 95% probability that the Average Rating of chocolate in the USA locations fall between 3.17 and 3.21 and the Cocoa Percentage falls between 69.9%-75.4%.
Visually, it is obvious that the confidence intervals increase for countries with smaller number of locations (smaller sample sizes). For example, for Venezuela, which has 31 and 25 locations in the data set for Average Rating and Cocoa Percentage respectively, the 95% confidence intervals cover a much wider range at 2.95-3.27 and 54.8%-90.1%.
This is because standard error (and the width of the confidence interval) increases when sample size decreases. When sample sizes are larger, they are closer to the true size of the population, and the sample means will tend to cluster increasingly around the true population mean i.e. the confidence interval will narrow given the same confidence level.
Often a sample size is considered “large enough” if it’s greater than or equal to 30, but this number can vary a bit based on the underlying shape of the population distribution.
How spread the data are around the mean value for each country. In the Average Rating chart, the small error bar for the USA indicated a low spread, i.e. the ratings across the different locations in the US are clustered around the mean. This contrasts with countries such as Belgium, whose larger error bar indicates that the rating values vary more from the mean.
How reliable the mean value is as a representative number for the data set. In other words, how accurately the mean value represents the data (small error bar = more reliable, larger error bar = less reliable). It’s important to note that just because you have a larger error bar, it does not indicate your data is invalid. Biological measurements are notoriously variable.
The likelihood of there being a significant difference between between different countries’ data. A “significant difference” means that the results that are seen are most likely not due to chance or sampling error. In any experiment or observation that involves sampling from a population, there is always the possibility that an observed effect would have occurred due to sampling error alone. But if result is “significant” then the investigator may conclude that the observed effect actually reflects the characteristics of the population rather than just sampling error or chance. To this end, the standard deviation error bars on a graph can be used to get a sense for whether or not a difference is significant. Look at overlap between the error bars in the figure below:
When standard deviation error bars do not overlap, as shown in the case between USA and Canada, it is a clue that the difference may be significant, but you cannot be sure. You must actually perform a statistical test to draw a conclusion.
When standard deviation errors bars overlap alot, like Canada and France, it is a clue that the difference may not be statistically significant. Similarly, a statistical test must be performed to draw a conclusion.
One unexpected challenge encountered in this exercise was that the use of crosstalk would disrupt the default CSS framework of distill. This resulted in the usual text formatting and sizing going haywire.
Upon further research, it was due to a Bootstrap HTML dependency attached to filter_select(), filter_checkbox(), and bscols(). This caused crosstalk to degrade the overall look when used in a non-Bootstrap CSS framework like distill.
RStudio developed a newer version of crosstalk in 2021 and the issue was resolved by installing the latest version of the package